Exposing Inner Kernels and Block Storage for Fast Parallel Dense Linear Algebra Codes⋆

نویسنده

  • José R. Herrero
چکیده

Efficient execution on processors with multiple cores requires the exploitation of parallelism within the processor. For many dense linear algebra codes this, in turn, requires the efficient execution of codes which operate on relatively small matrices. Efficient implementations of dense Basic Linear Algebra Subroutines exist (BLAS libraries). However, calls to BLAS libraries introduce large overheads when they operate on small matrices. High performance implementations of parallel dense linear algebra codes can be achieved by replacing calls to standard BLAS libraries with calls to specialized inner kernels which work efficiently on small data submatrices.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

New Data Structures for Matrices and Specialized Inner Kernels: Low Overhead for High Performance

Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of non-canonical data structures for dense linear algebra can be better exploited with the u...

متن کامل

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse...

متن کامل

Automatic Generation of Block-Recursive Codes

Block recursive codes for dense numerical linear algebra com putations appear to be well suited for execution on machines with deep memory hierarchies because they are e ectively blocked for all levels of the hierarchy In this paper we describe compiler technology to translate iterative versions of a number of numerical kernels into block recursive form We also study the cache behavior and perf...

متن کامل

Compiler-Optimized Kernels: An Efficient Alternative to Hand-Coded Inner Kernels

The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. In this paper, however, we present an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time. Using the resulting kernels we have been able to produce ...

متن کامل

Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS

Given an implementation of Distributed BLAS Level 3 kernels, the parallelization of dense linear algebra libraries such as LAPACK can be easily achieved. In this paper, we brieey describe the implementation and performance on the AP1000 of Distributed BLAS Level 3 for the rectangular r s block-cyclic matrix distribution. Then, the parallelization of the central matrix factorization and the trid...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008